INFO: Encoding special characters
Applies To:
eWebEditPro
Summary:
The HTML specification defines special characters for a set of punctuation
symbols, accented letters, and a variety of non-Latin characters. Since the HTML
specification has changed over time, so has the support for special characters
in the browsers. For instance, Microsoft has defined a number of special
characters that would (in the past) only display in Internet Explorer on
Windows. They are extended characters that map to binary values 128 to
159. Depending on the version of your browser and operating system, and whether
it was made by Microsoft, the characters may appear as expected or as a "?" or
small rectangle. The W3C has now adopted most of these extended characters in
HTML 4, but they are mapped to different binary values.
Choosing the wrong font face can also prevent the character from displaying.
This is a common problem when copying from Microsoft Word, where many of the
special characters are in the Symbol font. If the Symbol font is not available
in the browser or not permitted in the editor, the character will display as
some other character.
For example, the Euro symbol was designed for the European Economic Community
(EEC) in the late 1990's. Obviously operating systems and browsers created
earlier could not display it.
Euro character (shown using an image) |
|
Euro in Verdana font (display depends on your browser) |
I |
Euro in Courier New font (display depends on your browser) |
I |
Entity Name |
€ |
Microsoft Windows Extended Character Reference |
€ |
HTML 4 Character Reference |
€ |
Characters with binary values 160 to 255 are also special characters in that
they display differently depending on the language (or locale) of the browser
and the charset attribute in the meta tag on the web page.
For example,
<meta http-equiv=Content-Type content="text/html;
charset=iso-8859-2">
The way characters are displayed can even be controlled from the browser. For
example, in IE 5, from the menu bar, select View > Encoding > language of
your choice. (You may need to install the IE option for international language
support). In Netscape 4.7, select View > Character Set > language. The
possible languages are grouped as West European (Latin1), East European
(Latin2), Cyrillic, Arabic, Greek, Hebrew, and more. Each of these character
sets is defined by ISO 8859.
The ISO 8859 special characters are listed below. Change the encoding of your
browser to see the different ways the characters will be displayed.
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯
° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½
¾ ¿
À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
à
á â ã ä å æ ç è é ê ë ì í î ï
ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ
ÿ
In summary, the following factors affect how a special character is
displayed.
- Browser (Internet Explorer, Netscape, etc.)
- Version of the browser (3.0, 4.0, 5.0, etc.)
- Operating System (Windows 95, NT, 2000, Linux, Mac, etc.)
- Language of the O/S (English, Polish, Arabic, etc.)
- Font (Times, Arial, Helvetica, Symbol, etc.)
- Charset attribute in the meta tag (windows-1252, iso-8859-1, etc.)
- Encoding/Character Set setting of the browser (Western, Central European,
UTF-8, etc.)
Many Asian languages, such as Japanese, Korean, and Chinese, are represented
by two bytes instead of just one. The binary values for these characters are in
the range 256 to 65535. These are mapped as Unicode characters. eWebEditPro can
optionally convert these characters to their character reference or leave them
as double-byte binary Unicode values (which can be converted to UTF-8). For
example, a character whose binary value is 1234 will be converted to
"Ӓ".
eWebEditPro can be configured to represent extended and special characters in
a number of different ways. They are:
- Extended characters, special characters, and double-byte characters as
binary (Unicode, which can be converted to UTF-8).
- Extended and special characters as their entity name; double-byte
characters as their character reference.
- Extended characters, special characters, and double-byte characters as
their character reference.
- Extended characters as their entity name; special characters as binary;
double-byte characters as their character reference.
- Extended characters as HTML 4 character references; special characters as
binary; double-byte characters as their character reference.
charencode Attribute
To configure eWebEditPro, set the charencode attribute of
the clean tag in the config.xml file.
For example,
<!-- values for charencode: utf-8, binary, entityname,
charref, special, latin -->
<clean enabled="true"
charencode="charref" ...>
The values for charencode and their affect are shown in the
following table.
Value of
charencode |
Description |
Sampler
|
1. utf-8 or binary |
The sampler shows all the characters with binary values 128 to
255.
Characters 128-159 are extended characters. They are listed in two rows
that start with 80, which is the hexidecimal representation of 128, and
90.
Characters 160-255 are special characters. They are listed in several
rows that start with A0, which is the hexidecimal representation of 160,
through F0.
The sampler was displayed using IE 5.0 on English language Windows
(Latin1).
Double-byte characters are not shown, but would be their binary value
when stored. In View as HTML, they will always appear as their character
reference. When viewed in a browser, they will display as the character
only if the browser and operating system supports that language.
WARNING: These characters will not display properly
unless the operating system supports them. Even if they display in WYSIWYG
mode, they will display as character references in View As HTML mode. If
stored in a database, the database must support double-byte Unicode or
UTF-8 characters. May not be supported in Netscape Navigator 4. |
2. entityname |
Extended characters are represented using their
entity name (e.g., €) where possible.
Special characters as represented using their entity name (e.g.,
or À).
Double-byte characters are not shown, but would be their character
reference. |
3. charref |
Extended characters are represented using their HTML
4 character reference (e.g., €).
Special characters as represented using their character reference
(e.g.,   or À).
Double-byte characters are not shown, but would be their character
reference.
|
4. special |
Extended characters are represented using their
entity name (e.g., €) where possible.
Special characters remain as binary, except the non-breaking space,
which is represented as .
Double-byte characters are not shown, but would be their character
reference.
|
5. latin |
Extended characters are represented using their HTML
4 character reference (e.g., €).
Special characters remain as binary, except the non-breaking space,
which is represented as  .
Double-byte characters are not shown, but would be their character
reference.
|
Choosing a Value
The best charencode value to use will depend upon the
environment that the content will be viewed and personal preference for entity
names verses character references. If the environment (for example, a database)
only supports 7-bit ASCII characters, then either entityname or
charref must be used. Values of special or
latin will be smaller because the special characters require
one byte instead of six or more bytes to represent each character. A value of
binary is the smallest for content that consists mostly of
Asian characters (e.g., Japanese, Korean, Chinese) because the characters
require just two bytes instead of seven or more. Some sites convert Unicode
characters to a byte stream format of UTF-8. If your site consistently uses
UTF-8, use a value of utf-8.
The following table lists recommended charencode values given certain
conditions.
Condition |
Recommended charencode Value
|
Comments |
Database supports only 7 bit characters. |
entityname or charref |
Extended and special characters will be corrupted if wrong charencode
is selected. Choose between entityname or charref depending on your
preference for entity names or character references. |
Database supports only 8 bit characters. |
any except binary; use utf-8 only if your site uses UTF-8
consistently |
Some special and all double-byte characters will be corrupted. If you
use UTF-8, you must use it consistently on your site. |
Double-byte encoding, typically for an Asian language, and document
size is important. |
binary or utf-8 |
Database must support Unicode (double-byte) characters. Note: Unicode
is not the same as UTF-8. If you use UTF-8, you must use it
consistently on your site. |
Entity names are always preferred. |
entityname |
Extended and special characters will be their entity name. |
Entity names are preferred, but in a non-Western European language. |
special |
Special characters will be binary for different document encodings,
but extended characters will be their entity name. |
ISO-8859 (Latin) or windows charset encoding on document, but not
Latin1 (that is, not windows-1252 or iso-8859-1). |
latin or special |
Choose between latin or special depending on your preference for
entity names or character references for extended characters. |
Netscape Navigator 4 used for browsing |
charref |
Most extended and special characters will appear. Double-byte
characters do not appear if the browser or the operating system does not
support the language. If another charencode is selected, some extended and
special characters may appear as a "?" or their entity name. |
UTF-8 charset encoding on document. |
entityname or charref; use utf-8 only if your site uses UTF-8
consistently |
Special and double-byte characters will not display correctly as
binary. Choose between entityname or charref depending on your preference
for entity names or character references. If you use UTF-8, you must use
it consistently on your site. |
XML without XHTML DTD/Schema. |
charref; use utf-8 only if your site uses UTF-8 consistently |
XML only supports a very limited set of entity names unless the XHTML
(or other) DTD is provided. If you use UTF-8, you must use it consistently
on your site. |
Not sure. |
charref |
charref works with both UTF-8 encoding and XML parsers. It also gives
the best results in Netscape. If special characters always appear as West
European letters instead of the proper language, try
latin. |
More Resources:
How to produce UTF-8
(Ektron Knowledge Base Article)
Character entity references in HTML 4
http://www.w3.org/TR/REC-html40/sgml/entities.html
The ISO 8859 Alphabet Soup
http://wwwwbs.cs.tu-berlin.de/user/czyborra/charsets/
Dan's Web Tips: Characters and Fonts
http://webtips.dan.info/char.html